Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Nov 27, 2018

This reverts commit 6dc1bf6, #615.

109531f (#606) made the watch re-connects reliable, so make watch timeouts fatal again. This avoids confusing users by showing "Install complete!" messages when they may actually have a hung bootstrap process.

This reverts commit 6dc1bf6, openshift#615.

109531f (cmd/openshift-install/create: Retry watch connections,
2018-11-02, openshift#606) made the watch re-connects reliable, so make watch
timeouts fatal again.  This avoids confusing users by showing "Install
complete!" messages when they may actually have a hung bootstrap
process.
@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 27, 2018
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 27, 2018
if err != nil {
logrus.Error(errors.Wrap(err, "waiting for bootstrap-complete"))
return nil
return errors.Wrap(err, "waiting for bootstrap-complete")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to compare it to how the error would have been structured without wrap
fmt.Errorf("failed waiting for bootstrap-complete: %v", err)

I would suggest we keep it errors.Wrap(err, "failed waiting for bootstrap-complete")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$ git describe --dirty
v0.4.0-25-gb13b7fa-dirty
$ git diff
diff --git a/cmd/openshift-install/create.go b/cmd/openshift-install/create.go
index 5ba6517..8c95b8c 100644
--- a/cmd/openshift-install/create.go
+++ b/cmd/openshift-install/create.go
@@ -203,8 +203,8 @@ func destroyBootstrap(ctx context.Context, directory string) (err error) {
 				watcher, err := events.Watch(metav1.ListOptions{
 					ResourceVersion: sinceResourceVersion,
 				})
-				if err == nil {
-					return watcher, nil
+				if err == nil || true {
+					return watcher, err
 				}
 				select {
 				case <-eventContext.Done():
$ TAGS=libvirt_destroy hack/build.sh
$ openshift-install --dir=wking create cluster
INFO Fetching OS image...
INFO Using Terraform to create cluster...
INFO Waiting for bootstrap completion...
INFO API v1.11.0+d4cacc0 up
WARNING RetryWatcher - getting event failed! Re-creating the watcher. Last RV: 2214 
FATAL Error executing openshift-install: waiting for bootstrap-complete: watch closed before UntilWithoutRetry timeout 

That line includes both "FATAL" and "Error" already. Personally I think that's enough without also including an additional "failed", but I can add it if you like.

Copy link
Contributor

@abhinavdahiya abhinavdahiya Nov 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as error returns from this function first one is more clear when looking at the function return (not few levels up).

@wking
Copy link
Member Author

wking commented Nov 27, 2018

VPC exhaustion:

level=error msg="Error: Error applying plan:\n\n1 error(s) occurred:\n\n* module.vpc.aws_vpc.new_vpc: 1 error(s) occurred:\n\n* aws_vpc.new_vpc: Error creating VPC: VpcLimitExceeded: The maximum number of VPCs has been reached.\n\tstatus code: 400, request id: d1fea395

I've since cleaned up the leaked CI VPCs, and #740 is in flight to address the current leak.

/retest

@wking
Copy link
Member Author

wking commented Nov 27, 2018

e2e-aws:

level=error msg="Error: Error applying plan:\n\n1 error(s) occurred:\n\n* aws_route53_record.etcd_a_nodes[2]: 1 error(s) occurred:\n\n* aws_route53_record.etcd_a_nodes.2: timeout while waiting for state to become 'INSYNC' (timeout: 30m0s)\n\nTerraform does not automatically rollback in the face of errors.\nInstead, your Terraform state file has been partially updated with\nany resources that successfully completed. Please address the error\nabove and apply again to incrementally change your infrastructure."
level=fatal msg="Error executing openshift-install: failed to fetch Cluster: failed to generate asset \"Cluster\": failed to run terraform: failed to execute Terraform: exit status 1"

I'll wait until CI settles down and kick this again tonight.

@abhinavdahiya
Copy link
Contributor

don't care enough about #741 (comment) (just preferred the recommendation)

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 28, 2018
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 6971ab6 into openshift:master Nov 28, 2018
@wking wking deleted the fatal-wait-failure branch November 28, 2018 05:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants